This project aims to answer key questions about company performance using LinkedIn job postings and stock prices.
knitr::opts_chunk$set(echo = F)
library(tidyverse)
library(lubridate)
library(tidytext)
library(plotly)
# Load data
dat <- read_csv("../../temp_datalab_records_linkedin_company.csv")
# Remove & and combine fields
dat <- dat %>%
mutate(
industry = str_replace_all(industry, "&", "and"),
industry = str_replace_all(industry, "amp;", "")
)
# table(dat$industry)
The first image displays the industries that appear to be growing immensely over the data period. It arrives at an industry average by first taking the ratio of total company job postings to average company employees for each company in a given industry, then averaging this figure. As the title shows, Writing and Editing is growing the most from 2015 to 2019. Very small start-ups (<10 employees) were excluded from this analysis. Further analysis could link this data with industry stock indices to understand trends in hiring practices.
# Which industries are best represented in the dataset?
## 1. Basic: Display count by industry
dat %>%
count(industry, sort = T, name = "industry_total")
## # A tibble: 124 x 2
## industry industry_total
## <chr> <int>
## 1 Banking 168364
## 2 Biotechnology 152710
## 3 Financial Services 148143
## 4 Oil and Energy 123108
## 5 Retail 95384
## 6 Pharmaceuticals 92107
## 7 Information Technology and Services 85066
## 8 Computer Software 83214
## 9 Real Estate 81195
## 10 Internet 75450
## # … with 114 more rows
# Add count to dataset
dat <- dat %>%
add_count(industry, name = "industry_total") %>%
add_count(company_name, name = "company_total") %>%
add_count(company_name, industry, name = "industry_company_total")
## 2. Proportional by # of employees
dat <- dat %>%
group_by(company_name) %>%
mutate(
avg_employee_ct = mean(employees_on_platform, na.rm=T)
) %>%
ungroup
summary(dat$employees_on_platform)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 218 1083 7587 4513 577952
# need to filter for startups (<10)
# Suprising number of jobs compared to employees
surprise <- dat %>%
filter(avg_employee_ct >=10) %>%
group_by(company_name, industry) %>%
mutate(
ind_comp_emp_ratio = industry_company_total/avg_employee_ct
) %>%
ungroup %>%
group_by(industry)%>%
summarize(
mean_ratio = mean(ind_comp_emp_ratio, na.rm=T)
) %>% ungroup %>%
arrange(desc(mean_ratio))
head(surprise)
## # A tibble: 6 x 2
## industry mean_ratio
## <chr> <dbl>
## 1 Writing and Editing 39.4
## 2 Recreational Facilities and Services 13.2
## 3 Venture Capital and Private Equity 12.2
## 4 Sports 11.9
## 5 Nanotechnology 10.3
## 6 Tobacco 10.1
# Writing and Editing by far and away the most
# Display
surprise %>%
filter(mean_ratio >7) %>%
ggplot() +
geom_point(aes(mean_ratio, fct_reorder(industry, mean_ratio))) +
theme_classic() +
xlab("Ratio of Jobs to Employees") +
theme(axis.title.y = element_blank()) +
ggtitle(label = "Writing and Editing is hiring!")
ggsave("images/industry_hiring.png", dpi = 300, width = 7, height = 5)
The second image displays apparel companies of greater than 150 employees that have experienced hiring spikes that were not correlated with industry hiring. The plot shows the percentage difference between company deviation from average and industry deviation from average job postings. This data is useful to spot particular moments in time when a company appears to be doing better than average. Stock prices and news reports would be helpful to understand why these changes occured at this particular point in time. This analysis could be run to see which companies have been doing poorly as well, or any number of filters. It may be useful as a dynamic RShiny application for further interpretation.
The current proposal only displays LinkedIn job postings, but will be crossmatched with NYSE stock prices in the next phase. The code is below.
#Companies that hire at different times than the rest of their industry
dat %>%
# Company/ industry at particular time
add_count(industry, company_name, year(date_added), month(date_added), name= "company_industry_time") %>%
# Industry at particular time
add_count(industry, year(date_added), month(date_added), name = "industry_time") %>%
# Compare difference
group_by(company_name, industry) %>%
mutate(
ind_avg = mean(industry_time, na.rm=T),
com_avg = mean(company_industry_time, na.rm=T),
pctdiff_from_avg = (company_industry_time - com_avg)/(industry_time - ind_avg) *100
) %>%
ungroup -> dat
p<- dat %>%
filter(industry == "Apparel and Fashion" &
pctdiff_from_avg > 25 &
avg_employee_ct > 150) %>%
ggplot() +
geom_point(aes(x = date_added, y= pctdiff_from_avg, color = company_name)) +
theme_classic() +
ggtitle("Apparel companies that have hiring spikes",
subtitle = "Compared to industry average at that time.") +
ylab("% diff from avg") +
theme(
axis.title.x = element_blank(),
legend.title = element_blank()
)
ggsave(plot = p, filename="images/apparel_companies.png", dpi = 300, width = 7, height = 5)
q <- ggplotly(p)
q
# Save
htmlwidgets::saveWidget(as_widget(q), "apparel_companies.html")